Adjusted Mutual Information (AMI) — adjusted_mutual_info_score#
Adjusted Mutual Information (AMI) compares two cluster labelings (two partitions of the same n samples) and corrects Mutual Information (MI) for agreement expected by chance.
AMI is an external clustering metric: it needs two labelings (e.g., ground truth classes and predicted clusters, or two different clustering algorithms).
Goals
Build intuition: contingency tables → MI → chance correction → AMI.
Implement AMI from scratch in NumPy (including Expected MI).
Visualize how AMI behaves (permutation invariance, noise, number of clusters).
Use AMI as a black-box objective to tune a simple model (logistic regression threshold / regularization).
Quick import
from sklearn.metrics import adjusted_mutual_info_score
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from scipy.special import gammaln
from sklearn.datasets import make_blobs, make_classification
from sklearn.metrics import (
accuracy_score,
adjusted_mutual_info_score as skl_adjusted_mutual_info_score,
)
from sklearn.model_selection import train_test_split
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
rng = np.random.default_rng(0)
1) The core object: the contingency table#
Let two labelings of the same n samples be:
True partition: (U = {U_1,\dots,U_R})
Predicted partition: (V = {V_1,\dots,V_C})
Define the contingency table (N\in\mathbb{N}^{R\times C}) by
[ N_{ij} = |U_i \cap V_j|. ]
Row and column sums are
[ a_i = \sum_j N_{ij} = |U_i|,\qquad b_j = \sum_i N_{ij} = |V_j|,\qquad n = \sum_i a_i = \sum_j b_j. ]
From counts we can define empirical probabilities:
[ p_{ij} = \frac{N_{ij}}{n},\qquad p_{i\cdot} = \frac{a_i}{n},\qquad p_{\cdot j} = \frac{b_j}{n}. ]
AMI is computed entirely from (N) (it ignores feature geometry).
def _relabel_to_integers(labels: np.ndarray) -> np.ndarray:
labels = np.asarray(labels)
_, relabeled = np.unique(labels, return_inverse=True)
return relabeled
def contingency_matrix_np(labels_true: np.ndarray, labels_pred: np.ndarray) -> np.ndarray:
# Contingency table N_ij = |U_i ∩ V_j| for two labelings.
true_int = _relabel_to_integers(labels_true)
pred_int = _relabel_to_integers(labels_pred)
n_true = int(true_int.max()) + 1
n_pred = int(pred_int.max()) + 1
contingency = np.zeros((n_true, n_pred), dtype=np.int64)
np.add.at(contingency, (true_int, pred_int), 1)
return contingency
2) Mutual Information (MI)#
The mutual information between partitions (U) and (V) is the mutual information between the discrete random variables “which cluster am I in?” under each labeling.
[ \mathrm{MI}(U,V) = \sum_{i=1}^{R}\sum_{j=1}^{C} p_{ij},\log\frac{p_{ij}}{p_{i\cdot},p_{\cdot j}}. ]
Equivalent (using counts) for (N_{ij} > 0):
[ \mathrm{MI}(U,V) = \sum_{i,j: N_{ij}>0} \frac{N_{ij}}{n},\log\left(\frac{n,N_{ij}}{a_i,b_j}\right). ]
The entropies are [ H(U) = -\sum_i p_{i\cdot}\log p_{i\cdot},\qquad H(V) = -\sum_j p_{\cdot j}\log p_{\cdot j}, ] and (\mathrm{MI}(U,V) = H(U) + H(V) - H(U,V)).
Units: with natural logs, MI is in nats (use (\log_2) for bits).
Why MI needs adjustment#
MI tends to increase when you use more clusters (more degrees of freedom), even for unrelated/random labelings. AMI fixes that by subtracting the MI you’d expect by chance given the cluster sizes.
def entropy_from_counts(counts: np.ndarray) -> float:
counts = np.asarray(counts, dtype=np.float64)
n = counts.sum()
if n == 0:
return 0.0
p = counts[counts > 0] / n
return float(-np.sum(p * np.log(p)))
def mutual_info_from_contingency(contingency: np.ndarray) -> float:
# Mutual information in nats, computed from a contingency table.
contingency = np.asarray(contingency, dtype=np.int64)
n = int(contingency.sum())
if n == 0:
return 0.0
row_sums = contingency.sum(axis=1)
col_sums = contingency.sum(axis=0)
row_idx, col_idx = np.nonzero(contingency)
nij = contingency[row_idx, col_idx].astype(np.float64)
# MI = sum (nij/n) * log( (n*nij)/(ai*bj) )
numerator = nij * n
denominator = row_sums[row_idx] * col_sums[col_idx]
return float(np.sum((nij / n) * np.log(numerator / denominator)))
3) Chance correction: Expected Mutual Information (EMI)#
AMI uses the expected MI under the null model:
the two labelings are random
but their cluster size distributions ({a_i}) and ({b_j}) are fixed (same marginals as observed)
This is the “generalized hypergeometric” model.
For a fixed pair of clusters ((i,j)), the intersection size (K = N_{ij}) follows a hypergeometric distribution:
[ K \sim \mathrm{Hypergeometric}(N=n,; K=a_i,; n=b_j) ]
so
[ \mathbb{P}(K=k) = \frac{\binom{a_i}{k}\binom{n-a_i}{b_j-k}}{\binom{n}{b_j}},\qquad k\in[\max(0,a_i+b_j-n),\min(a_i,b_j)]. ]
Then
[ \mathrm{EMI} = \mathbb{E}[\mathrm{MI}(U,V)] = \sum_{i=1}^{R}\sum_{j=1}^{C}\sum_{k} \frac{k}{n},\log\left(\frac{n,k}{a_i,b_j}\right),\mathbb{P}(K=k). ]
This looks heavy, but for typical cluster counts it’s feasible (the inner sum only ranges up to (\min(a_i,b_j))).
def expected_mutual_info_from_marginals(row_sums: np.ndarray, col_sums: np.ndarray) -> float:
# Expected mutual information under the hypergeometric null (fixed marginals).
row_sums = np.asarray(row_sums, dtype=np.int64)
col_sums = np.asarray(col_sums, dtype=np.int64)
n = int(row_sums.sum())
if n == 0:
return 0.0
# log-factorials for stable combinatorics: log(n!) = gammaln(n+1)
log_factorial = gammaln(np.arange(n + 1, dtype=np.float64) + 1.0)
def log_comb(n_: int, k_: np.ndarray | int) -> np.ndarray:
k_arr = np.asarray(k_, dtype=np.int64)
return log_factorial[n_] - log_factorial[k_arr] - log_factorial[n_ - k_arr]
expected_mi = 0.0
for a in row_sums:
for b in col_sums:
k_min = max(0, a + b - n)
k_max = min(a, b)
if k_max < k_min:
continue
# Include k=0 in the probability normalization. (Its contribution to MI is 0.)
k = np.arange(k_min, k_max + 1, dtype=np.int64)
# Hypergeometric PMF in log-space:
# P(K=k) = C(a,k) * C(n-a, b-k) / C(n,b)
log_p = log_comb(a, k) + log_comb(n - a, b - k) - log_comb(n, b)
# Normalize to avoid under/overflow in exp.
m = float(log_p.max())
p = np.exp(log_p - m)
p = p / p.sum()
mask = k > 0
if not np.any(mask):
continue
k_pos = k[mask]
term = (k_pos / n) * np.log((n * k_pos) / (a * b))
expected_mi += float(np.sum(p[mask] * term))
return float(expected_mi)
def adjusted_mutual_info_score_np(
labels_true: np.ndarray,
labels_pred: np.ndarray,
*,
average_method: str = "arithmetic",
) -> float:
# Adjusted mutual information (AMI) in NumPy (+ gammaln for stability).
contingency = contingency_matrix_np(labels_true, labels_pred)
n = int(contingency.sum())
if n == 0:
return 1.0
row_sums = contingency.sum(axis=1)
col_sums = contingency.sum(axis=0)
h_true = entropy_from_counts(row_sums)
h_pred = entropy_from_counts(col_sums)
# Degenerate case: both partitions put everything in one cluster.
if h_true == 0.0 and h_pred == 0.0:
return 1.0
mi = mutual_info_from_contingency(contingency)
emi = expected_mutual_info_from_marginals(row_sums, col_sums)
if average_method == "arithmetic":
normalizer = 0.5 * (h_true + h_pred)
elif average_method == "min":
normalizer = min(h_true, h_pred)
elif average_method == "max":
normalizer = max(h_true, h_pred)
elif average_method == "geometric":
normalizer = float(np.sqrt(h_true * h_pred))
else:
raise ValueError(
"average_method must be one of: 'arithmetic', 'min', 'max', 'geometric'"
)
denom = normalizer - emi
if denom <= np.finfo(float).eps:
return 0.0
return float((mi - emi) / denom)
4) AMI definition (and why it’s useful)#
Given MI and its expectation under chance, AMI is
[ \mathrm{AMI}(U,V) = \frac{\mathrm{MI}(U,V) - \mathrm{EMI}}{\mathrm{Normalizer} - \mathrm{EMI}}. ]
In scikit-learn, the normalizer is based on (H(U)) and (H(V)) via average_method:
arithmetic: (\frac{H(U) + H(V)}{2}) (default)min: (\min(H(U), H(V)))max: (\max(H(U), H(V)))geometric: (\sqrt{H(U)H(V)})
Key properties
Permutation invariant: renaming cluster IDs doesn’t change AMI.
Chance-corrected: unrelated random partitions have AMI ≈ 0.
Comparable across different numbers of clusters (unlike raw MI).
Can be negative if agreement is worse than chance (under this null model).
5) Sanity check vs scikit-learn#
Below we compare our NumPy implementation to sklearn.metrics.adjusted_mutual_info_score on random labelings.
def _assert_matches_sklearn(n_trials: int = 200) -> None:
for average_method in ["arithmetic", "min", "max", "geometric"]:
for n in [10, 50, 200]:
for _ in range(n_trials):
y_true = rng.integers(0, 6, size=n)
y_pred = rng.integers(0, 8, size=n)
ours = adjusted_mutual_info_score_np(
y_true, y_pred, average_method=average_method
)
skl = skl_adjusted_mutual_info_score(
y_true, y_pred, average_method=average_method
)
if not np.isclose(ours, skl, atol=1e-10, rtol=0.0):
raise AssertionError(
f"Mismatch (n={n}, average_method={average_method}): "
f"ours={ours}, sklearn={skl}"
)
_assert_matches_sklearn(n_trials=50)
6) Toy examples: permutation invariance and errors#
Let’s build three predicted labelings from the same ground truth:
Perfect match up to a label permutation
A few mistakes (some points moved to the wrong cluster)
Completely random labels
We’ll look at the contingency tables and AMI values.
y_true = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
# Perfect up to relabeling: 0→2, 1→0, 2→1
perm = {0: 2, 1: 0, 2: 1}
y_pred_permuted = np.array([perm[v] for v in y_true])
# A few mistakes
y_pred_noisy = y_pred_permuted.copy()
y_pred_noisy[[1, 4, 7]] = [0, 1, 2]
# Random labels with 3 clusters
rng_local = np.random.default_rng(1)
y_pred_random = rng_local.integers(0, 3, size=y_true.size)
examples = {
"perfect (permuted labels)": y_pred_permuted,
"noisy": y_pred_noisy,
"random": y_pred_random,
}
scores = {
name: adjusted_mutual_info_score_np(y_true, y_pred)
for name, y_pred in examples.items()
}
scores
{'perfect (permuted labels)': 1.0,
'noisy': 0.16502260912888653,
'random': 0.12559873432804056}
fig = make_subplots(
rows=1,
cols=3,
subplot_titles=[
f"{name}<br>AMI={scores[name]:.3f}" for name in examples.keys()
],
)
for col, (name, y_pred) in enumerate(examples.items(), start=1):
cont = contingency_matrix_np(y_true, y_pred)
fig.add_trace(
go.Heatmap(
z=cont,
coloraxis="coloraxis",
text=cont,
texttemplate="%{text}",
x=[f"pred {j}" for j in range(cont.shape[1])],
y=[f"true {i}" for i in range(cont.shape[0])],
),
row=1,
col=col,
)
fig.update_layout(
height=320,
width=950,
title_text="Contingency tables (counts)",
coloraxis=dict(colorscale="Blues"),
)
fig
7) How AMI behaves with label noise#
We start from the true labels, then randomly “corrupt” a fraction (p) of labels (assigning them to a random cluster). AMI should decrease smoothly from near 1 toward ~0.
We also show that AMI stays near 0 for random labels, even if we keep the number of clusters fixed.
X, y_true = make_blobs(
n_samples=400,
centers=4,
cluster_std=1.2,
random_state=0,
)
n = y_true.size
k_true = len(np.unique(y_true))
noise_grid = np.linspace(0.0, 0.9, 19)
amis = []
for p in noise_grid:
y_pred = y_true.copy()
n_flip = int(round(p * n))
flip_idx = rng.choice(n, size=n_flip, replace=False)
# Random reassignment (may coincidentally keep the same label for some points)
y_pred[flip_idx] = rng.integers(0, k_true, size=n_flip)
amis.append(adjusted_mutual_info_score_np(y_true, y_pred))
# Baseline: fully random labels with the same number of clusters
n_trials = 80
ami_random = np.array(
[
adjusted_mutual_info_score_np(y_true, rng.integers(0, k_true, size=n))
for _ in range(n_trials)
]
)
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=noise_grid,
y=amis,
mode="lines+markers",
name="AMI (corrupted labels)",
)
)
fig.add_trace(
go.Scatter(
x=[noise_grid.min(), noise_grid.max()],
y=[ami_random.mean(), ami_random.mean()],
mode="lines",
name=f"AMI mean (random labels, k={k_true})",
line=dict(dash="dash"),
)
)
fig.add_trace(
go.Scatter(
x=[noise_grid.min(), noise_grid.max()],
y=[np.quantile(ami_random, 0.1), np.quantile(ami_random, 0.1)],
mode="lines",
line=dict(width=0),
showlegend=False,
)
)
fig.add_trace(
go.Scatter(
x=[noise_grid.min(), noise_grid.max()],
y=[np.quantile(ami_random, 0.9), np.quantile(ami_random, 0.9)],
mode="lines",
fill="tonexty",
line=dict(width=0),
name="random AMI 10%–90% band",
)
)
fig.update_layout(
title="AMI vs label noise",
xaxis_title="fraction of labels corrupted",
yaxis_title="adjusted mutual information (AMI)",
height=420,
)
fig
8) Why the adjustment matters: number of clusters bias#
Raw MI (and many related measures) can grow when the predicted labeling uses more clusters, even if it’s random.
Below we sample random predicted labels with different k and plot both MI and AMI. If the labels are unrelated to the ground truth, AMI should stay near 0 regardless of k, while MI tends to drift upward.
k_grid = np.arange(2, 31)
mi_by_k = []
ami_by_k = []
for k in k_grid:
y_pred = rng.integers(0, k, size=n)
cont = contingency_matrix_np(y_true, y_pred)
mi_by_k.append(mutual_info_from_contingency(cont))
ami_by_k.append(adjusted_mutual_info_score_np(y_true, y_pred))
fig = make_subplots(rows=1, cols=1, specs=[[{"secondary_y": True}]])
fig.add_trace(go.Scatter(x=k_grid, y=mi_by_k, mode="lines+markers", name="MI (nats)"), secondary_y=False)
fig.add_trace(go.Scatter(x=k_grid, y=ami_by_k, mode="lines+markers", name="AMI"), secondary_y=True)
fig.update_layout(
title="Random labels: MI increases with k, AMI stays near 0",
xaxis_title="k (number of predicted clusters)",
height=430,
)
fig.update_yaxes(title_text="MI (nats)", secondary_y=False)
fig.update_yaxes(title_text="AMI", secondary_y=True)
fig
9) AMI under chance: distribution#
Even under the null model, AMI fluctuates around 0 for finite n. We can visualize this by sampling many random predicted labelings.
k = k_true
n_trials = 250
ami_samples = np.array(
[
adjusted_mutual_info_score_np(y_true, rng.integers(0, k, size=n))
for _ in range(n_trials)
]
)
fig = px.histogram(
ami_samples,
nbins=30,
title=f"AMI distribution for random labels (k={k}, n={n})",
labels={"value": "AMI"},
)
fig.add_vline(x=float(ami_samples.mean()), line_dash="dash", line_color="black")
fig.update_layout(height=420)
fig
10) Using AMI as an optimization target (logistic regression)#
AMI is defined on discrete labels, so it’s not differentiable with respect to model parameters. That means:
You typically don’t train models by gradient descent on AMI directly.
You can use AMI as a black-box objective for:
choosing thresholds
selecting hyperparameters (regularization strength, features, etc.)
model selection
Below is a from-scratch logistic regression trained with log-loss, and then tuned by maximizing AMI on a validation set.
def sigmoid(z: np.ndarray) -> np.ndarray:
return 1.0 / (1.0 + np.exp(-z))
def standardize_fit(X: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
mean = X.mean(axis=0)
std = X.std(axis=0)
std = np.where(std == 0.0, 1.0, std)
return mean, std
def standardize_transform(X: np.ndarray, mean: np.ndarray, std: np.ndarray) -> np.ndarray:
return (X - mean) / std
def add_intercept(X: np.ndarray) -> np.ndarray:
return np.c_[np.ones((X.shape[0], 1)), X]
def train_logistic_regression_gd(
X: np.ndarray,
y: np.ndarray,
*,
lr: float = 0.2,
n_steps: int = 2000,
reg_strength: float = 0.0,
) -> np.ndarray:
# L2-regularized logistic regression with gradient descent.
# reg_strength corresponds to λ in: loss + (λ/2)||w||^2 (excluding intercept).
X = np.asarray(X, dtype=np.float64)
y = np.asarray(y, dtype=np.float64)
n_samples, n_features = X.shape
w = np.zeros(n_features, dtype=np.float64)
for _ in range(n_steps):
p = sigmoid(X @ w)
grad = (X.T @ (p - y)) / n_samples
# L2 penalty (do not regularize intercept)
grad[1:] += reg_strength * w[1:]
w -= lr * grad
return w
def predict_proba_logreg(X: np.ndarray, w: np.ndarray) -> np.ndarray:
return sigmoid(X @ w)
X, y = make_classification(
n_samples=800,
n_features=2,
n_redundant=0,
n_informative=2,
n_clusters_per_class=1,
class_sep=1.6,
random_state=0,
)
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.35, random_state=0, stratify=y
)
mean, std = standardize_fit(X_train)
X_train_std = standardize_transform(X_train, mean, std)
X_val_std = standardize_transform(X_val, mean, std)
X_train_ = add_intercept(X_train_std)
X_val_ = add_intercept(X_val_std)
reg_grid = np.array([0.0, 1e-4, 1e-3, 1e-2, 1e-1, 1.0])
threshold_grid = np.linspace(0.05, 0.95, 37)
ami_grid = np.zeros((reg_grid.size, threshold_grid.size))
acc_grid = np.zeros((reg_grid.size, threshold_grid.size))
for i, reg in enumerate(reg_grid):
w = train_logistic_regression_gd(
X_train_, y_train, lr=0.25, n_steps=2500, reg_strength=float(reg)
)
p_val = predict_proba_logreg(X_val_, w)
for j, thr in enumerate(threshold_grid):
y_hat = (p_val >= thr).astype(int)
ami_grid[i, j] = adjusted_mutual_info_score_np(y_val, y_hat)
acc_grid[i, j] = accuracy_score(y_val, y_hat)
best_idx = np.unravel_index(np.argmax(ami_grid), ami_grid.shape)
best_reg = float(reg_grid[best_idx[0]])
best_thr = float(threshold_grid[best_idx[1]])
best_ami = float(ami_grid[best_idx])
(best_reg, best_thr, best_ami)
(0.0, 0.625, 0.9249381865602224)
fig = go.Figure(
data=go.Heatmap(
z=ami_grid,
x=threshold_grid,
y=[str(v) for v in reg_grid],
colorscale="Viridis",
colorbar=dict(title="AMI"),
)
)
fig.update_layout(
title=f"Validation AMI for logistic regression (best: λ={best_reg}, thr={best_thr:.2f}, AMI={best_ami:.3f})",
xaxis_title="threshold",
yaxis_title="L2 regularization strength λ",
height=480,
)
fig
# Compare the AMI-optimal threshold to accuracy-optimal threshold for the best λ
w_best = train_logistic_regression_gd(
X_train_, y_train, lr=0.25, n_steps=2500, reg_strength=best_reg
)
p_val = predict_proba_logreg(X_val_, w_best)
ami_curve = np.array(
[
adjusted_mutual_info_score_np(y_val, (p_val >= t).astype(int))
for t in threshold_grid
]
)
acc_curve = np.array([accuracy_score(y_val, (p_val >= t).astype(int)) for t in threshold_grid])
thr_acc = float(threshold_grid[np.argmax(acc_curve)])
fig = make_subplots(rows=1, cols=1, specs=[[{"secondary_y": True}]])
fig.add_trace(go.Scatter(x=threshold_grid, y=ami_curve, name="AMI"), secondary_y=False)
fig.add_trace(go.Scatter(x=threshold_grid, y=acc_curve, name="Accuracy"), secondary_y=True)
fig.add_vline(x=best_thr, line_dash="dash", line_color="black")
fig.add_vline(x=thr_acc, line_dash="dot", line_color="gray")
fig.update_layout(
title=f"Threshold tuning at λ={best_reg} (dash: best AMI, dot: best accuracy)",
xaxis_title="threshold",
height=430,
)
fig.update_yaxes(title_text="AMI", secondary_y=False)
fig.update_yaxes(title_text="Accuracy", secondary_y=True)
fig
11) Pros, cons, and when to use AMI#
Pros
Permutation invariant (cluster labels are arbitrary).
Corrects for chance: random labelings tend to score near 0.
Works when the two labelings have different numbers of clusters.
Symmetric: (\mathrm{AMI}(U,V)=\mathrm{AMI}(V,U)).
Cons / pitfalls
Requires two labelings; not an intrinsic clustering quality measure (can’t tell if clusters are “good” in feature space).
Based only on the contingency table; ignores geometry (two clusterings with the same contingency are indistinguishable).
Not differentiable → not a natural training loss; typically used for evaluation or black-box tuning.
Computing EMI can be more expensive than simpler indices (though usually fine for moderate
nand cluster counts).Interpretation is less immediate than, say, accuracy; AMI is best used comparatively.
Good use cases
Evaluating clustering against ground truth classes (external validation).
Measuring agreement between two clustering algorithms / runs (stability studies).
Hyperparameter selection for clustering when a reference labeling is available (e.g., choose
kin KMeans by maximizing AMI).
Exercises#
Modify the noise experiment to keep the cluster sizes fixed while permuting membership. How does AMI behave?
Implement
normalized_mutual_info_scorefrom the same building blocks and compare it to AMI across differentk.Use AMI to select
kfor a simple KMeans implementation onmake_blobs(treat the true labels as reference).
References#
scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_mutual_info_score.html
Vinh, Epps, Bailey (2010): Information Theoretic Measures for Clusterings Comparison